Model Evaluation

Owner: Daniel Soukup - Created: 2025.11.01

In this final recipe, we run in-depth evaluation on our best performing model predictions using a number of classification metrics and visuals. Note that our model selection was technically handled by the Modeling notebook and the automated hyperparameter tuning flow. Again, our focus is to understand model accuracy in the general sense as well as specific shortcomings to motivate future model (and data processing) iterations.

NOTE: the markdown in the notebook refers to our latest run and subsequent re-training/re-evaluation would likely change the exact numbers in the output fields.

Load Prediction Data

We load the prediction datasets for analysis.

Prediction Statistics

Lets look at high level statistics of the predictions first.

We can already see that while the real labels contain ~8% high income, our predictions were positive only ~3.5% of the time. However, the predicted probabilities are fairly well calibrated showing the same mean almost exactly. We will experiement with alternative cutoffs to the default 0.5 to find better precision/recall trade-offs later in the notebook.

We can see the probability distribution for the true 0/1 labels. For class 0, we are nicely concentrated on 0 as expected, however we do see a fair number of class 1 samples with low probabilities (e.g., instances misclassified by a large margin). Note the log scale on the y axis.

As future work, we can explore the segment of data where these misclassifications occurred to better understand how to address the issue (e.g., are certain groups over represented under the misclassified samples).

Calculate Metrics

We will look at the standard binary classification metrics using:

Observations:

The confusion matrices give another view of the correct and misclassified samples:

The misclassifed samples are off the diagonal:

Precision-Recall Curves

As mentioned previously, we can consider adjusting our hard predictions to achieve a better precision-recall tradeoff. We can see this on the precision-recall curve - note that we used the area under this curve as the loss function to optimize during training our XGBoost models.

We could select an alternative threshold:

We can see that if we are to sacrifice some precision, the recall can be brought up e.g, with ~0.2-0.3 threshold we can achieve an almost even precision and recall around 0.5-.06.

ROC-AUC

Lets look at the ROC-AUC scores and ROC curve finally:

In the plots below we show the TPR rate against the FPR: ideally, we can achieve high TPR with low FPR values.

That is, ideally, the curve is hugging the top left corner (as we see it) but again, this metric is not sensitive enough for imbalanced datasets.

Next Steps

To better understand model performance: